Table of content

1. Introduction
1.1 Data loading
1.2 Data scaling

2. Broad analysis
2.1 Boxplots
2.2 Densityplot
2.3 K-means clustering
2.4 PCA

3. Specific analysis: lapatinib
3.1 Analysis of biomarker and t-test between treated and untreated cells
3.1.1 Expression of biomakers before and after the treatment of Lapatinib
3.1.2 Searching for our own biomarkers
3.1.3 Boxplot of our biomarkers
3.1.4 ggBoxplot of our biomarkers’ basalexpression in breast cells with Kruskal-Wallis test
3.1.5 Heatmap of biomarkers in breast cells
3.2. Paired t-test between the Lapatinib treated und untreated data
3.2.1 Genome wide paired t-test
3.2.2 Paired t-test of our biomarker before and after treatment
3.2.3 Paired t-test over genes, increased confidence level (99 %)
3.2.4 Visualization of the genome-wide paired t-test in a vulcano plot
3.3 K-means
3.4 PCA
3.5 Spearman correlation

4. Main questions
4.1 Question 1: Does the cell doubling time correlate with reduced drug sensitivity?
4.1.1 Linear regression ~ Inoculation_Density
4.1.2 Linear regression ~ Doubling-time
4.1.3 Multiple regression
4.1.4 Table of conclusion
4.2 Question 2: Does Lapatinib have a similar effect on lung cancer as Erlotinib?
4.2.1 Heatmap
4.2.2 Plot Erlotinib and Lapatinib, coloured by tissue
4.2.3 Pearson correlation
4.2.4 Lung cellines
4.2.5 Anova
4.3 Question 3: Does Lapatinib have a similar effect on brain metastasis as it does on breast cancer?
4.3.1 Volcano plot
4.3.2 Heatmap for visualization
4.3.3 Investigation of correlation

1. Introduction

Our project is about lapatinib and divided into three parts: Broad analysis, specific analysis and our three main questions. The main questions are the most important part of our project. They address whether cell doubling time correlates with reduced drug sensitivity and the effect of lapatinib on lung cancer and brain metastasis. Twenty-five per cent of breast cancers overexpress ErbB2 / HER which leads to a more aggressive phenotype. Lapatinib blocks tyrosine kinases that belong to the HER2 receptor, which is increasingly found on the cell surface of cancer cells. HER2 is a receptor for the growth factor epidermal growth factor (EGF), which stimulates the cell division of cancer cells. By blocking these HER2 receptors, the unregulated growth of cancer cells can be controlled again. Lapatinib binds the ATP-binding site of the receptor’s intracellular domain and inhibits the survival and proliferation pathways of the tumor cell.

1.1 Data loading

1.2 Data scaling

After checking for normalization, we scaled our data in the first place to provide the scaled data for further analysis.

list = list(Treated,Untreated)
nlist = lapply(list,scale)
Treated = as.data.frame(nlist[[1]])
Untreated = as.data.frame(nlist[[2]])
Fold_Change = Treated - Untreated
Fold_Change = data.frame(Fold_Change)
rm(NCI_TPW_gep_treated,NCI_TPW_gep_untreated,list,nlist)

Remove missing values

n= as.data.frame(t(NegLogGI50))
rmv.rows = apply(n, 1, function(x) {
  sum(is.na(x))
})
NLGI50.all = n[-which(rmv.rows > 0), ]  # Removing any row with 1 or more missing values
rm(rmv.rows, n)
# used for second main question 

2. Broad analysis

2.1 Boxplots

As a very first step of our data analysis, we checked the for normalization.

To check, if there is a connection between the different drugs, we coloured the plot concerning the drugs.

As shown, there is a connection between the different “boxes” and the different drugs. This could be the reason for different batches so we scale the data.

As you can see the scaling is necessary to compensate the batch effect. Scaling is then applied on Untreated as well before further analysis.

2.2 Densityplot

As a other method to get in touch with the data, we checked if there is a difference between the treated and untreated genexpression. The abline shows the 3 quantiles ( 25 % 50 % 75 % )

As assumed, the expression patterns don’t differentiate much.

2.3 K-means clustering

To look for clusters in the raw data we performed a k-menas clustering and searched for potentially clusters using the most variable (>75% Var), thus informative genes.

Running a loop for the best cluster-amount (searching for an “ellbow”)

As we wanted an “ellbow”, it is hard to determine the amount of clusters.

2.4 PCA

Since the clustering method by performing k means clustering was not successful, we performed a principal component analysis for dimensionality reduction as another method for identifying clusters or patterns while preserving as much information of the original data set as possible.

pca <- prcomp(t(Fold_Change), scale = TRUE)

By plotting a scree plot of the variation percentage, we can display how much variation a principal component captures from the original data.

pca.var <- pca$sdev^2  # sdev calculates variation each PC accounts for

pca.var.per <- round(pca.var/sum(pca.var)*100, 1) # since percentages make more sense than normal variation values calculate % of variation

plot(pca.var.per[1:10], main = "Scree plot", type = "b", xlab = "Principal Components", ylab = "% variation")

plot(cumsum(pca.var.per[1:15]), main = "cumulative variation", type = "l", xlab = "Principal Components", ylab = "% variation") 

As seen in the scree plot, the first three or four PCs have capture most of the information since the curve bends at an “elbow” at PC4 - this is our cutting off point.

PCs describe variation and account for the varied influences of the original characteristics. Such influences, or loadings, can be traced back from the PCA to find out which genes contribute most to the single PCS.

## get names of top 10 genes that contribute most to pc1
loading_scores_1 <- pca$rotation[,1]
gene_score <- abs(loading_scores_1) ## sort magnitude
gene_score_ranked <- sort(gene_score, decreasing = TRUE)


top_10_genes <- names(gene_score_ranked[1:10])
top_10_genes # show names of top 10 genes
##  [1] "DNAJC2"  "NGDN"    "GTPBP4"  "CCDC59"  "DNTTIP2" "AKAP8"   "PAPSS1" 
##  [8] "TRMT1"   "BRF2"    "YRDC"

To determine clusters of samples based on their similarity, we plotted PC1 and PC2 aswell as a PC2 and PC3 while colouring by tissue type and drug respectively.

par(mfrow=c(1,2), cex=0.7)
#par(mar=c(4, 6, 3, 3))

## colored by drug
#plot PC1 and PC2
plot(pca$x[,1], 
     pca$x[,2], 
     col = color_drug,
     pch = 19, 
     xlab = paste("PC1 (",pca.var.per[1],"%)"), 
     ylab = paste("PC2 (",pca.var.per[2],"%)"))
#create legend
legend("topleft", 
       legend = drug, 
       col = magma, 
       pch = 19, 
       xpd = "TRUE",
       bty = "n",
       cex = 0.75
)
#create title
mtext("PCA of Fold Change  colored by drug", 
      side = 3, 
      line = -2,
      cex = 1.2,
      font = 2, 
      outer = TRUE)

#plot PC2 and PC3
plot(pca$x[,2], 
     pca$x[,3], 
     col = color_drug,
     pch = 19, 
     xlab = paste("PC2 (",pca.var.per[2],"%)"), 
     ylab = paste("PC3 (",pca.var.per[3],"%)"))
#create legend
legend("right", 
       legend = drug, 
       col = magma, 
       pch = 19, 
       xpd = "TRUE",
       bty = "n",
       cex = 0.75,
       inset = c(0, 2)
)
#create title
mtext("PCA of Fold Change  colored by drug", 
      side = 3, 
      line = -2,
      cex = 1.2,
      font = 2, 
      outer = TRUE)

## colored by tissue
#plot PC1 and PC2
par(mfrow=c(1,2), cex=0.7)

plot(pca$x[,1], 
     pca$x[,2], 
     col = color_tissue,
     pch = 19, 
     xlab = paste("PC1 (",pca.var.per[1],"%)"), 
     ylab = paste("PC2 (",pca.var.per[2],"%)"))
#create legend
legend("topleft", 
       legend = tissue, 
       col = viridis, 
       pch = 19, 
       xpd = "TRUE",
       bty = "n",
       cex = 0.75
)
#create title
mtext("PCA of Fold Change  colored by tissue", 
      side = 3, 
      line = -2,
      cex = 1.2,
      font = 2, 
      outer = TRUE)

#plot PC2 and PC3
plot(pca$x[,2], 
     pca$x[,3], 
     col = color_tissue,
     pch = 19, 
     xlab = paste("PC2 (",pca.var.per[2],"%)"), 
     ylab = paste("PC3 (",pca.var.per[3],"%)"))
#create legend
legend("right", 
       legend = tissue, 
       col = viridis, 
       pch = 19, 
       xpd = "TRUE",
       bty = "n",
       cex = 0.75,
       inset = c(0, 2)
)
#create title
mtext("PCA of Fold Change  colored by tissue", 
      side = 3, 
      line = -2,
      cex = 1.2,
      font = 2,
      outer = TRUE)

As we can see, in contrast to the k means method, clusters can be determined. Respecting the colour annotation, it is clear to say that the samples cluster according to drug treatment.

3. Specific Analysis: Lapatinib

3.1 Analysis of biomarker and t-test between treated and untreated cells

3.1.1 Expression of biomakers before and after the treatment of Lapatinib

Treated_t<-data.frame(t(Treated))
Untreated_t<-data.frame(t(Untreated))
LapatinibUntreated<-dplyr::select(Untreated, contains('Lapa'))
LapatinibTreated<-dplyr::select(Treated, contains('Lapa'))
before=as.matrix(LapatinibUntreated)
after=as.matrix(LapatinibTreated)
LapUn<-data.frame(t(before))
LapTreat<-data.frame(t(after))
We found these genes as biomarkers in a paper and presented them in our first presentation. Unfortunately, we can not use them as a biomarkers. (Source of the paper: Basal-Like Breast Cancer Defined by Five Biomarkers Has Superior Prognostic Value than Triple-Negative Phenotype, by Maggie C.U. Cheang, David Voduc, Chris Bajdik, et al. Clin Cancer Res 2008; 14:1368-1376)

3.1.2 Searching for our own biomarkers

The log2FoldChange was examined from the mean of all genes before and after treatment with lapatinib.
Treat<-colMeans(LapTreat, na.rm = TRUE)
Untreat<-colMeans(LapUn, na.rm = TRUE)
ut<-(Treat-Untreat)
ut_sort<-sort(ut, decreasing = TRUE)
head(ut_sort)
##     GDF15     NUPR1     DDIT3     TRIB3      ATF3      ASNS 
## 0.7150148 0.6843715 0.5794749 0.5753359 0.5603209 0.5305063

GDF3 Growth Differentiation Factor 3 belongs to the transforming growth factor beta (TGF-beta) superfamily and has some intrinsic activity.

NUPR1 stands for nuclear protein, transcriptional regulator, 1. It regulates the transcription during stress signals by empowering cells with resistance to stress.

DDIT3 encodes for the DNA damage-inducible transcript 3 protein, a member of the CCAAT/enhancer-binding protein which is a transcription factor. It is important for the response of cell stresses. The DNA damage-inducible transcript 3 protein induces cell cycle arrest and apoptosis in response to ER stress.

TRIB3 stands for Tribbles Pseudokinase 3. The kinase is induced by the transcription factor NF-kappaB. It is essential for cell proliferation.

ATF3 stands for activating transcription factor 3. It belongs to the mammalian activation transcription factor/cAMP responsive element-binding (CREB). ATF-3 is induced by physiological stress.

ASNS stands for Asparagine Synthetase: The enzyme catalyzes the production of the amino acid L-asparagine.

3.1.3 Boxplot of our biomarkers

boxplot(LapTreat$GDF15,LapUn$GDF15,
        LapTreat$NUPR1, LapUn$NUPR1, 
        LapTreat$DDIT3, LapUn$DDIT3, 
        LapTreat$TRIB3, LapUn$TRIB3,
        LapTreat$ATF3,LapUn$ATF3, 
        LapTreat$ASNS, LapUn$ASNS, 
        col=c("deeppink4", "deeppink4", "burlywood1", "burlywood1", "darkcyan", "darkcyan", "pink", "pink", "darkolivegreen3", "darkolivegreen3", "lightblue", "lightblue"),
        names = c("GDF15 before", "GDF15 after", "NUPR1 before", "NUPR1 after ", "DDIT3 before", "DDIT3 after" ,"TRIB3 before","TRIB3 after", "ATF3 before", "ATF3 after", "ASNS before",  "ASNS after"),
        main="Expression of Biomakers before and after treatment with Lapatinib", 
        xlab="Biomarkers",
        ylab="Expression")

breastcells<-subset(Metadata, tissue == "Breast")
breast<-subset(CCLE_basalexpression, breastcells$cell  %in%  colnames(CCLE_basalexpression))
tbreastcells<-data.frame(t(breast))
breast_marker <- c(tbreastcells$GDF15,tbreastcells$NUPR1, tbreastcells$DDIT3, tbreastcells$TRIB3, tbreastcells$ATF3, tbreastcells$ASNS)
breast_marker_matrix <- data.frame(tbreastcells$GDF15,tbreastcells$NUPR1, tbreastcells$DDIT3, tbreastcells$TRIB3, tbreastcells$ATF3, tbreastcells$ASNS)
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.GDF15"] <- "GDF15"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.NUPR1"] <- "NUPR1"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.DDIT3"] <- "DDIT3"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.TRIB3"] <- "TRIB3"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.ATF3"] <- "ATF3"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.ASNS"] <- "ASNS"
Biomarkers<-c(rep('DF15',45),rep('NUPR1',45),rep('DDIT3',45),rep('TRIB3',45),rep('ATF3',45),rep('ASNS',45))
Expression<-c(tbreastcells$GDF15,tbreastcells$NUPR1, tbreastcells$DDIT3, tbreastcells$TRIB3, tbreastcells$ATF3, tbreastcells$ASNS)
df_breast<-data.frame(Expression, Biomarkers)

3.1.4 ggBoxplot of our biomarkers’ basalexpression in breast cells with Kruskal-Wallis test

The null hypothesis is: there is no difference between the groups. The Kruskal-Wallis test is for independent samples. It tests whether the central trends of several independent samples differ.
ggboxplot(df_breast, x="Biomarkers", y="Expression", color = "Biomarkers",
          add = "jitter", legend = "none")+
  rotate_x_text(angle = 45)+
  stat_compare_means(method = "kruskal.test", label.y = 15)+   # Add global kurskal wallis test p-value
  stat_compare_means(label = "p.signif", method = "t.test",
                     ref.group = ".all.", hide.ns = TRUE)      # Pairwise comparison against all

3.1.5 Heatmap of biomarkers in breast cells

pheat_breast<-pheatmap(breast_marker_matrix ,
                      show_rownames = FALSE, 
                      show_colnames =TRUE,
                      cutree_cols = 4, 
                      cutree_rows = 3, 
                      color = viridis(4),
                      drop_levels = TRUE, 
                      clustering_method = "ward.D2", 
                      scale="row")
pheat_breast

3.2. Paired t-test between the Lapatinib treated und untreated data

3.2.1 Genome wide paired t-test

t.test(before, after, paired=TRUE)
## 
##  Paired t-test
## 
## data:  before and after
## t = -1.5497e-13, df = 718140, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0003680021  0.0003680021
## sample estimates:
## mean of the differences 
##           -2.909656e-17

3.2.2 Paired t-test of our biomarker before and after treatment

The paired t-test shows a statistical significance of our biomarkers.
before_biomarker<-LapUn %>% dplyr::select("GDF15", "NUPR1", "DDIT3", "TRIB3", "ATF3", "ASNS")
after_biomarker<-LapTreat %>% dplyr::select("GDF15", "NUPR1", "DDIT3", "TRIB3", "ATF3", "ASNS")
table<-data.table(col_t_paired(before_biomarker, after_biomarker, alternative = "two.sided", mu = 0,conf.level = 0.95))
Marker=c("GDF15", "NUPR1", "DDIT3", "TRIB3", "ATF3", "ASNS")
pvalue<-table$pvalue
table[, .(Marker,pvalue)]
##    Marker       pvalue
## 1:  GDF15 2.612447e-12
## 2:  NUPR1 1.327187e-11
## 3:  DDIT3 9.979311e-20
## 4:  TRIB3 6.995573e-20
## 5:   ATF3 2.142510e-12
## 6:   ASNS 2.207251e-14

3.2.3 Paired t-test over genes, increased confidence level (99 %)

row_t_test<-as.data.frame(row_t_paired(before, after, alternative = "two.sided", mu = 0,conf.level = 0.99))
summary(row_t_test$pvalue)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0002317 0.0756336 0.2421297 0.4428285 0.9998899

3.2.4 Visualization of the genome-wide paired t-test in a vulcano plot

Lap_t_test<-data.table(col_t_paired(LapUn, LapTreat, alternative = "two.sided", mu = 0,conf.level = 0.95))
padj<-p.adjust(Lap_t_test$pvalue, "BH")

Lapatinib_fc<-as.data.frame.numeric(colMeans(LapUn-LapTreat))
Lgenes<-(colnames(LapUn))

vulcano<-data.frame(genes=Lgenes,Fold=Lapatinib_fc, pvalue=padj)
names(vulcano)[names(vulcano) == "colMeans.LapUn...LapTreat."] <- "log2FoldChange"
with(vulcano, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-2.5,2)))

# Add colored points: red if padj<0.05, blue if log2FC>1, green if both)
with(subset(vulcano, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(vulcano, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="blue"))
with(subset(vulcano, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="green"))

# Label points with the textxy function from the calibrate plot
with(subset(vulcano, padj<.05 & abs(log2FoldChange)>1), textxy(log2FoldChange, -log10(pvalue), labs = genes, cex=.8))

3.3 K-means

LapatinibFold = select(Fold_Change, contains("Lapa"))

#Determining the number of clusters
topVarFold = apply(LapatinibFold, 1, var)
summary(topVarFold)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0007514 0.0090894 0.0138573 0.0197222 0.0226351 0.4897715
# Using the most variable, thus informative genes
topVarFold75 = LapatinibFold[topVarFold > quantile(topVarFold, probs = 0.75), ]
dim(topVarFold75)
## [1] 3325   54
km = kmeans(x = t(topVarFold75), centers = 2, nstart = 10)
km$tot.withinss
## [1] 6627.055
km = kmeans(x = t(topVarFold75), centers = 3, nstart = 10)
km$tot.withinss
## [1] 6159.495
#running a loop for the best n (searching for "ellbow")
wss = (sapply(2:7, function(k) {
  kmeans(x =t(topVarFold75), centers = k)$tot.withinss})  )            
plot(2:7, wss, type = "b", pch = 19, xlab = "Number of clusters K", ylab = "Total within-clusters sum of squares", main = "Determining the amount of clusters from Foldchange")

# Using the silhouett method
D = dist(t(topVarFold75))
km = kmeans(x = t(topVarFold75), centers = 3, nstart = 10)
s = silhouette(km$cluster, D)
plot(s)

3.4 PCA

Since the clustering method by performing k means clustering again was not successful, we performed a principal component analysis over the selected lapatinib samples for identifying clusters or patterns while preserving as much information of the original data set as possible.

L_fc <- select(Fold_Change, contains("Lapa"))

# PCA
pca <- prcomp(t(L_fc), scale = TRUE)

By plotting a scree plot of the variation percentage, we can display how much variation a principal component captures from the original data.

pca.var <- pca$sdev^2  # sdev calculates variation each PC accounts for
pca.var.per <- round(pca.var/sum(pca.var)*100, 1) 
# since percentages make more sense then normal variation values
# calculate % or variation, which is much more interesing


plot(pca.var.per[1:10], main = "Scree plot", type = "l", xlab = "Principal Components", ylab = "% variation")

plot(cumsum(pca.var.per[1:15]), main = "cumulative variation", type = "l", xlab = "Principal Components", ylab = "% variation") 

## get names of top 10 genes that contribute most to pc1
loading_scores_1 <- pca$rotation[,1]

gene_score <- abs(loading_scores_1) ## sort magnitude
gene_score_ranked <- sort(gene_score, decreasing = TRUE)

top_10_genes <- names(gene_score_ranked[1:10])
top_10_genes # show names of top 10 genes
##  [1] "MIR3658///UCK2"           "EIF2S1"                  
##  [3] "NOP16"                    "UCHL5"                   
##  [5] "HSPD1"                    "ITM2B"                   
##  [7] "MIR664B///SNORA56///DKC1" "KLHDC2"                  
##  [9] "NXT1"                     "PNO1"

To determine clusters of samples based on their similarity, we plotted PC1 and PC2 aswell as a PC2 and PC3 while colouring by tissue type and drug respectively.

par(mfrow=c(2,2), cex=0.7, oma=c(2,3,2,2))
par(mar=c(4, 6, 3, 3))
## colored by msi
#plot PC1 and PC2
msi12 <- plot(pca$x[,1], 
          pca$x[,2], 
          col = color_msi,
          pch = 19, 
          xlab = paste("PC1 (",pca.var.per[1],"%)"), 
          ylab = paste("PC2 (",pca.var.per[2],"%)"))

#create legend
legend(450, 0, 
       legend = msi, 
       col = colormsi, 
       pch = 19, 
       xpd = "FALSE",
       bty = "n",
       cex = 0.75
)
#create title
mtext("PCA of Fold Change colored by MSI", 
      side = 3, 
      line = -2,
      cex = 1.2,
      font = 2,
      outer = TRUE)


#plot PC2 and PC3
msi23 <- plot(pca$x[,2], 
          pca$x[,3], 
          col = color_msi,
          pch = 19, 
          xlab = paste("PC2 (",pca.var.per[2],"%)"), 
          ylab = paste("PC3 (",pca.var.per[3],"%)"))
    



##colored by tissue
#plot PC1 and PC2
tissue12 <- plot(pca$x[,1], 
              pca$x[,2], 
              col = color_tissue,
              pch = 19, 
              xlab = paste("PC1 (",pca.var.per[1],"%)"), 
              ylab = paste("PC2 (",pca.var.per[2],"%)"))
             
#create legend
legend(450, 0, 
       legend = tissue, 
       col = magma, 
       pch = 19, 
       xpd = "TRUE",
       bty = "n",
       cex = 0.75
)
#create title
mtext("PCA of Fold Change colored by tissue", 
      side = 3, 
      line = -16,
      cex = 1.2,
      font = 2,
      outer = TRUE)

#plot PC2 and PC3
tissue23 <- plot(pca$x[,2], 
              pca$x[,3], 
              col = color_tissue,
              pch = 19, 
              xlab = paste("PC2 (",pca.var.per[2],"%)"), 
              ylab = paste("PC3 (",pca.var.per[3],"%)"))

As we can see, rarely any clusters can be determined. In Addition, the colour annotation does not reveal any pattern of clustering.

3.5 Spearman correlation

Modelling of a linear relationship between cell doubling time and inoculation density.

# Spearman correlation
cor(Cellline_Annotation$Doubling_Time, Cellline_Annotation$Inoculation_Density, method = "spearman")
## [1] 0.5674381
plot(Cellline_Annotation$Doubling_Time, Cellline_Annotation$Inoculation_Density, pch= 16, col= "blue", main = "Spearman correlation between Doubling Time and Inoculation Density", xlab = "Doubling Time", ylab = "Inoculation Density")
lm(Cellline_Annotation$Inoculation_Density~ Cellline_Annotation$Doubling_Time)
## 
## Call:
## lm(formula = Cellline_Annotation$Inoculation_Density ~ Cellline_Annotation$Doubling_Time)
## 
## Coefficients:
##                       (Intercept)  Cellline_Annotation$Doubling_Time  
##                            9110.9                              169.4
abline(lm(Cellline_Annotation$Inoculation_Density ~ Cellline_Annotation$Doubling_Time), col = "red", lwd = 2)

As you can see from the plot, there is a positive correlation between inoculation density and doubling time. This means that the higher the inoculation density, the higher the doubling time, which suggests that the tumour grows slowly because the cells need more time to divide.

4. Main questions

4.1 Question 1: Does the cell doubling time correlate with reduced drug sensitivity?

As mentioned in our presentation, we want to create a model to predict GI50-values and thus to predict, if Lapatinib is a good choice. Therefore, we took several datasets and performed 3 regression analyses.

4.1.1 Linear regression ~ Inoculation_Density

The first linear model tries to predict the G-50 value under the data of the Inoculation density.

4.1.2 Linear regression ~ Doubling-time

The second linear model tries to predict the G-50 value under the data of the Doubling-time.

4.1.3 Multiple regression

Finally, we did a multiple regression with both datasets to predict GI50-values.

4.1.4 Table of conclusion

The following table shows important values.

Inoculation_Density Doublingtime Multiple regression
R-squared-value 0.0013505 0.0913649 0.0044753
F-statistic 0.0622076 4.6253841 0.0921552
RMSE of test-dataset 0.5192093 0.6574516 0.7987091

The data is hard to discuss, because we used random values for our test and training dataset. Hence, different values on every run are obtained. We could use “selected values” (First 20) but this would be kind of cheating.

4.2 Question 2: Does Lapatinib have a similar effect on lung cancer as Erlotinib?

As we read in the paper “Antitumor and antiangiogenic effect of the dual EGFR and HER-2 tyrosine kinase inhibitor lapatinib in a lung cancer model. (Diaz et al., 2010)”, lapatinib and erlotinib are said to have similar effects. To verify this, we first looked at the correlation of the drugs and then created a graph showing the difference between GI50 for a certain cell line and the middle GI50 for both drugs.

4.2.1 Heatmap

cor = cor(NLGI50.all)
pheatmap(cor, col = cm.colors(256), main = "Correlation NLGI50")

Erlotinib and Lapatinib have a strong correlation.

4.2.2 Plot Erlotinib and Lapatinib, coloured by tissue

#differece
diff = data.frame(erlotinib = NLGI50.all$erlotinib - mean(NLGI50.all$erlotinib), lapatinib = NLGI50.all$lapatinib- mean(NLGI50.all$lapatinib))
diff$celllines = rownames(NLGI50.all)
#create vector to insert column tissue from Metadata

tissue = sapply(1:nrow(diff), function(x) {
  position = which(as.character(Metadata$cell) == diff[x, "celllines"])[1] #if tissue occurs several times, take the first
  out = as.character(Metadata[position, "tissue"]) #output the tissue at this position
  return(out)
})
diff$tissue = tissue
rm(tissue)

diff$celllines = factor(diff$celllines, levels = diff$celllines[order(diff$tissue)]) #Classified by tissue

e =  ggplot(diff, aes(x = celllines, y = erlotinib, fill = tissue))+ 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  labs(title = "Mean graph plot of NLGI50 values for Erlotinib")+
  theme(plot.title = element_text(hjust = 0.5, size = 19),
        axis.title = element_text(size=17))

l = ggplot(diff, aes(x = celllines, y = lapatinib, fill = tissue)) + 
  geom_bar(stat="identity") + 
  coord_flip() + 
  labs(title="Mean graph plot of NLGI50 values for Lapatinib")+
  theme(plot.title = element_text(hjust = 0.5, size = 19),
        axis.title = element_text(size=17)) #change size and center title

ggarrange(e, l,
          labels = c("A", "B"),
          ncol = 2, nrow = 1)

rm(e,l)

The difference from the NegLogGI50 for a particular cell line and the mean NegLogGI50 is plotted here for Erlotinib and Lapatinib.

4.2.3 Pearson correlation

cor(NLGI50.all$erlotinib, NLGI50.all$lapatinib, method = "pearson")
## [1] 0.6528188

A Pearson correlation coefficient of ~ 0.65 confirms that these patterns are similar. One reason for this is that both erlotinib and lapatinib are EGFR inhibitors. Cell lines that were more sensitive are displayed as bars that project to the right of the mean. Cell lines that were less sensitive are displayed with bars projected to the left.

4.2.4 Lung cellines

To answer our more specific question, whether lapatinib also has an effect on lung cancer, we will only look at the cell lines in lung cancer.

#only lung with mean all
### load data
Metadata_Lapatinib_treated = Metadata[which(Metadata$drug == "lapatinib" & Metadata$dose != "0nM"),]
NegLogGI50 = as.data.frame(readRDS(paste0(wd, "/Data/NegLogGI50.rds")))
#lung cells from Metadata
Lung_Metadata_L_treated = Metadata[which(Metadata$drug == "lapatinib" & Metadata$dose != "0nM" & Metadata$tissue == "Lung"),] 
celllines = Lung_Metadata_L_treated$cell
NegLogGI50.lung = as.data.frame(t(NegLogGI50[c("erlotinib", "lapatinib"), celllines]))

#Difference
dif.NegLogGI50.lung = data.frame(erlotinib = NegLogGI50.lung$erlotinib -  mean(NLGI50.all$erlotinib), lapatinib = NegLogGI50.lung$lapatinib -  mean(NLGI50.all$lapatinib)) #erlotinib data - mean value, lapatinib data - mean value
dif.NegLogGI50.lung$celllines = rownames(NegLogGI50.lung)

# PLot Erlotinib and Lapatinib

e = ggplot(dif.NegLogGI50.lung,aes(x = celllines, y = erlotinib)) + geom_bar(stat = "identity", fill = "skyblue") + geom_text(aes(label=round(erlotinib, 2)), vjust = -0.5, color = "black", size = 3) + coord_flip() + labs(title = "Mean graph plot of NLGI50 values for Erlotinib, only lung genes")+
  theme(plot.title = element_text(size = 15),
        axis.title = element_text(size=15))

l = ggplot(dif.NegLogGI50.lung,aes(x = celllines, y = lapatinib)) + geom_bar(stat = "identity", fill = "skyblue") + geom_text(aes(label=round(lapatinib, 2)), vjust = -0.5, color = "black", size = 3) + coord_flip() + labs(title = "Mean graph plot of NLGI50 values for Lapatinib, only lung genes")+
  theme(plot.title = element_text(size = 15),
        axis.title = element_text(size=15))

ggarrange(e, l,
          labels = c("A", "B"),
          ncol = 2, nrow = 1)

rm(e,l)

The difference from the NegLogGI50 for cell line from lung tissue and the mean NegLogGI50 is plotted here for Erlotinib and Lapatinib.

#correlation lung

cor(NegLogGI50.lung$erlotinib, NegLogGI50.lung$lapatinib, method = "pearson")
## [1] 0.9609488

A pearson correlation coefficent of ~ 0.96 suggests that Lapatinib has a similar effect on lung cancer as Erlotinib.

Selection of Lapatinib and Erlotinib treated cells

lapa<-data.frame(Metadata[which(Metadata[,'drug'] == "lapatinib"), ])
erlo<-data.frame(Metadata[which(Metadata[,'drug'] == "erlotinib"), ])
el<-right_join(lapa,erlo, by="cell")
rmv.rows = apply(el, 1, function(x) {
  sum(is.na(x))
})  # Go through each row and sum up all missing values
row.names(rmv.rows)
Create data frame with Lapatinib and Erlotinib data
fc<-data.frame(Treated-Untreated)
all<-data.frame(fc[grep("lapatinib|erlotinib", colnames(fc))])


all.rmv<-data.frame(all %>% dplyr::select(
                    -"CCRF.CEM_erlotinib_10000nM_24h", 
                    -"HL.60_erlotinib_10000nM_24h", 
                    -"HT29_erlotinib_10000nM_24h", 
                    -"K.562_erlotinib_10000nM_24h", 
                    -"LOX_erlotinib_10000nM_24h",
                    -"SR_erlotinib_10000nM_24h",
                    -"COLO.205_lapatinib_10000nM_24h"))
                  
la<-data.frame(all.rmv[grep("lapatinib", colnames(all.rmv))])

er<-data.frame(all.rmv[grep("erlotinib", colnames(all.rmv))])
erla<-data.frame(er,la)
Drug<-c(rep('Erlotinib',53), rep('Lapatinib',53))
log2FoldChange<-apply(erla, MARGIN = 2, FUN = mean)
df_drug<-data.frame(log2FoldChange, Drug)

4.2.5 Anova

Checking requirements for ANOVA

First, the fold change data of lapatinib and erlotinib must be checked for normality. Anova can be used if the data is distributed normally.
qqnorm(log2FoldChange, pch = 1, frame = FALSE)
qqline(log2FoldChange, col = "steelblue", lwd = 2)
Anova anticipates homogeneity of variances. If the p-value of the Fligner-Killeen test is greater than 0.05, the homogeneity of variance is not rejected and this condition is met. The p-value of 0,08067 let us know that variances are homogeneous.
fligner.test(log2FoldChange ~ Drug, data=df_drug )
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  log2FoldChange by Drug
## Fligner-Killeen:med chi-squared = 0.75387, df = 1, p-value =
## 0.3853
Anova tests whether the means of several independent groups differ. A p-value of 0.53 means that the result is not significant. Therefore, there is no significant difference between Erlotinib and Lapatinib.
ggboxplot (data = df_drug, x="Drug", y="log2FoldChange", color = "Drug",    
          add = "jitter", legend = "none")+
          rotate_x_text(angle = 45)+
          stat_compare_means(method = "anova",  label.x = 2)+ # Add global anova p-value
          stat_compare_means(label = "p.signif", method = "t.test",
          ref.group = ".all.", hide.ns = TRUE)      # Pairwise comparison against all

4.3 Question 3: Does Lapatinib have a similar effect on brain metastasis as it does on breast cancer?

As a proven fact, lapatinb crosses the brain-blood barrier. Since breast cancer tends to spread and form metastasis in brain cells and lapatinib is used in anti-breast cancer therapy, we wanted to analyse our data for similar effects after lapatinib treatment in breast and cns tissue cells.

At first, we selected lapatinib response genes by creating matrices with cns- and breast-cell line fold change values.

L_fc <- select(Fold_Change, contains("Lapa"))
L_fc <- as.data.frame(t(L_fc))
rownames(Metadata) <- Metadata$sample


L_treated <- select(Treated, contains("Lapa"))
L_treated <- t(L_treated)
L_untreated <- select(Untreated, contains("Lapa"))
L_untreated <- t(L_untreated)



# selecting breast Lapatinib samples
breast <- Metadata[Metadata[,'tissue']=="Breast",]
rownames(breast) <- breast$sample
rownames(breast) <- gsub(x = rownames(breast), pattern = "-", replacement = ".")  

breastFC <- subset(L_fc, rownames(L_fc) %in% rownames(breast))
breastTreated <- subset(L_treated, rownames(L_treated) %in% rownames(breast))
breastUntreated <- subset(L_untreated, rownames(L_untreated) %in% rownames(breast))#


# selecting CNS Lapatinib samples
cns <- Metadata[Metadata[,'tissue']=="CNS",]
rownames(cns) <- cns$sample
rownames(cns) <- gsub(x = rownames(cns), pattern = "-", replacement = ".")

cnsFC <- subset(L_fc, rownames(L_fc) %in% rownames(cns))
cnsTreated <- subset(L_treated, rownames(L_treated) %in% rownames(cns))
cnsUntreated <- subset(L_untreated, rownames(L_untreated) %in% rownames(cns))

Next, we performed a paired t-test of treated and untreated cns and breast samples, respectively, to determine statistical significant fold change values. Since we are performing multiple testing over different cell lines from breast and cns tissues, we want to adjust our p value by performing a Benjamini Hochberg correction.

#performing a paired t-test of treated and untreated samples
t_test_cns <- col_t_paired(cnsTreated, cnsUntreated, alternative = "two.sided", mu = 0,conf.level = 0.95)
t_test_breast <- col_t_paired(breastTreated, breastUntreated, alternative = "two.sided", mu = 0,conf.level = 0.95)

#obtaining Benjamini-Hochberg adjusted p-values
pval_cns <- t_test_cns$pvalue
pval_breast <- t_test_breast$pvalue

fdr_cns <- p.adjust(pval_cns, "BH")
fdr_breast <- p.adjust(pval_breast, "BH")


#obtaining mean FC values over all samples 
breastFCm <- as.numeric(colMeans(breastFC))
cnsFCm <- as.numeric(colMeans(cnsFC))
genes <- colnames(breastFC)

4.3.1 Volcano plot

For visualization of the biological significance (fold change values) and the statistical significance (adjusted p-values), we plotted our data in an interactive volcano plot.

## breast volcano plot
#creating a matrix containg all needed values for plotting
diff_df_breast <- data.frame(gene = genes, Fold = breastFCm, FDR = fdr_breast)
diff_df_breast$absFold <- abs(diff_df_breast$Fold)
head(diff_df_breast)
##     gene         Fold       FDR     absFold
## 1   A1CF  0.037268413 0.8765540 0.037268413
## 2    A2M -0.032213825 0.7188608 0.032213825
## 3 A4GALT  0.006012452 0.9793436 0.006012452
## 4  A4GNT -0.053969518 0.4235638 0.053969518
## 5   AAAS  0.081656784 0.5283372 0.081656784
## 6   AACS  0.023767096 0.8115022 0.023767096
# add a grouping column; default value is "not significant"
diff_df_breast$group <- "NotSignificant"



# change the grouping for the entries with significance but not a large enough Fold change
diff_df_breast[which(diff_df_breast['FDR'] < 0.5 & (diff_df_breast['absFold']) < 0.5 ),"group"] <- "Significant"

# change the grouping for the entries a large enough Fold change but not a low enough p value
diff_df_breast[which(diff_df_breast['FDR'] > 0.5 & (diff_df_breast['absFold']) > 0.5 ),"group"] <- "FoldChange"

# change the grouping for the entries with both significance and large enough fold change
diff_df_breast[which(diff_df_breast['FDR'] < 0.5 & (diff_df_breast['absFold']) > 0.5 ),"group"] <- "Significant&FoldChange"


# Find and label the top peaks.
top_peaks_breast <- diff_df_breast[with(diff_df_breast, order(Fold, FDR)),][1:10,]
top_peaks_breast <- rbind(top_peaks_breast, diff_df_breast[with(diff_df_breast, order(-Fold, FDR)),][1:10,])


# Add gene labels for all of the top genes we found
# creating an empty list, and filling it with entries for each row in the dataframe
# each list entry is another list with named items that will be used 
a <- list()
for (i in seq_len(nrow(top_peaks_breast))) {
  m <- top_peaks_breast[i, ]
  a[[i]] <- list(
    x = m[["Fold"]],
    y = -log10(m[["FDR"]]),
    text = m[["gene"]],
    xref = "x",
    yref = "y",
    showarrow = TRUE,
    arrowhead = 0.5,
    ax = 20,
    ay = -40
  )
}

plot_breast <- plot_ly(data = diff_df_breast, x = diff_df_breast$Fold, y = -log10(diff_df_breast$FDR), type = "scatter", text = diff_df_breast$gene, mode = "markers", color = diff_df_breast$group) %>% 
  layout(title ="Volcano Plot of Lapatinib breast cancer samples", 
         xaxis = list(title="log2 Fold Change"),
         yaxis = list(title="FDR")) %>%
  layout(annotations = a)
plot_breast
###thresholds still need to be discussed
## CNS volcano plot

diff_df_cns <- data.frame(gene = genes, Fold = cnsFCm, FDR = fdr_cns)
diff_df_cns$absFold <- abs(diff_df_cns$Fold)
head(diff_df_cns)
##     gene         Fold       FDR     absFold
## 1   A1CF  0.066575311 0.5939566 0.066575311
## 2    A2M  0.038348381 0.6873009 0.038348381
## 3 A4GALT  0.000390011 0.9980719 0.000390011
## 4  A4GNT -0.018219799 0.8780106 0.018219799
## 5   AAAS  0.014723327 0.9008420 0.014723327
## 6   AACS  0.003887384 0.9870209 0.003887384
# add a grouping column; default value is "not significant"
diff_df_cns$group <- "NotSignificant"


# change the grouping for the entries with significance but not a large enough Fold change
diff_df_cns[which(diff_df_cns['FDR'] < 0.5 & (diff_df_cns['absFold']) < 0.5 ),"group"] <- "Significant"

# change the grouping for the entries a large enough Fold change but not a low enough p value
diff_df_cns[which(diff_df_cns['FDR'] > 0.5 & (diff_df_cns['absFold']) > 0.5 ),"group"] <- "FoldChange"

# change the grouping for the entries with both significance and large enough fold change
diff_df_cns[which(diff_df_cns['FDR'] < 0.5 & (diff_df_cns['absFold']) > 0.5 ),"group"] <- "Significant&FoldChange"


# Find and label the top peaks..
top_peaks_cns <- diff_df_cns[with(diff_df_cns, order(Fold, FDR)),][1:10,]
top_peaks_cns <- rbind(top_peaks_cns, diff_df_cns[with(diff_df_cns, order(-Fold, FDR)),][1:10,])


a <- list()
for (i in seq_len(nrow(top_peaks_cns))) {
  m <- top_peaks_cns[i, ]
  a[[i]] <- list(
    x = m[["Fold"]],
    y = -log10(m[["FDR"]]),
    text = m[["gene"]],
    xref = "x",
    yref = "y",
    showarrow = TRUE,
    arrowhead = 0.5,
    ax = 20,
    ay = -40
  )
}

plot_cns <- plot_ly(data = diff_df_cns, x = diff_df_cns$Fold, y = -log10(diff_df_cns$FDR),type = "scatter", text = diff_df_cns$gene, mode = "markers", color = diff_df_cns$group) %>% 
  layout(title ="Volcano Plot of Lapatinib CNS cancer samples",
         xaxis = list(title="log2 Fold Change"),
         yaxis = list(title="FDR"))%>%
  layout(annotations = a)
plot_cns

4.3.2 Heatmap for visualization

To visualize the expression patterns of the top peak genes present in both tissue types we calculated a heat map, comparing the expression profiles of the two tissue types.

# selecet top peak genes common in cns and breast tissue
tpb_comparison <- subset(top_peaks_breast, gene %in% top_peaks_cns$gene)
tpc_comparison <- subset(top_peaks_cns, gene %in% top_peaks_breast$gene)


# order common genes alphabetically
tpb_comparison <- tpb_comparison[order(tpb_comparison$gene),]
tpc_comparison <- tpc_comparison[order(tpc_comparison$gene),]


## creating heat map of FCs to compare values 
cor_mat <- cbind("breast" = tpb_comparison$Fold, "cns" = tpc_comparison$Fold)
rownames(cor_mat) <- tpb_comparison$gene
data <- read.delim


pheatmap(
  mat               = cor_mat,
  color             = magma(10),
  border_color      = "black",
  show_colnames     = TRUE,
  show_rownames     = TRUE,
  drop_levels       = TRUE,
  fontsize          = 14,
  main              = "Comparison:
  FC levels of cns and breast top peak genes"
)

4.3.3 Investigation of correlation

For better visualization of the expression patterns of the top peak genes among breast and cns tissues we computed a heat map of the mean fold changes for direct comparison.

#correlation between top peak gene expression patterns in breast and cns tissues treated by lapatinib
cor_mat <- as.data.frame(cor_mat)
dif.FC.BC = data.frame(breast = cor_mat$breast -  mean(t(breastFC)), cns = cor_mat$cns -  mean(t(cnsFC))) #breast data - mean value, cns data - mean value
dif.FC.BC$gene = rownames(cor_mat)

To gather a more detailed conception of the gene expression regulation values (aka the fold change) we also computed a comparing bar plot showing the exact fold change values.

# PLot

b = ggplot(dif.FC.BC,aes(x = gene, y = breast)) +
  geom_bar(stat = "identity", fill = "skyblue") + 
  geom_text(aes(label = round(breast, 2)), vjust = -0.5, color = "black", size = 3) + 
  coord_flip() + labs(title = "Mean of Fold Change for breast top peak genes")+
  theme(plot.title = element_text(hjust = 0.5,size = 15),
        axis.title = element_text(size = 15))


c = ggplot(dif.FC.BC,aes(x = gene, y = cns)) +
  geom_bar(stat = "identity", fill = "skyblue") + 
  geom_text(aes(label = round(cns, 2)), vjust = -0.5, color = "black", size = 3) + 
  coord_flip() + labs(title = "Mean of Fold Change for CNS top peak genes")+
  theme(plot.title = element_text(hjust = 0.5, size = 15),
        axis.title = element_text(size = 15))

ggarrange(b, c,
          labels = c("A", "B"),
          ncol = 2, nrow = 1)

rm(b,c)

At last, we investigated the expression regulation in both tissue types by calculating the pearson correlation between the fold change values of the top peak genes.

#correlation
cor(cor_mat$breast, cor_mat$cns, method = "pearson")
## [1] 0.921812

The obtained correlation value of ~ 0.92 indicates that lapatinib treatment yields a similar effect on gene expression in breast and cns tissue cells and therefore might be possible to treat cancer in brain metastasis.